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Power-law Null Model for Bystander Mutations 

in Cancer 

Loes Olde Loohuis, Andreas Witzel, and Bud Mishra, 

Abstract 

In this paper we study Copy Number Variation (CNV) data. The underlying process generating CNV 
segments is generally assumed to be memory-less, giving rise to an exponential distribution of segment lengths. 
In this paper, we provide evidence from cancer patient data, which suggests that this generative model is 
too simplistic, and that segment lengths follow a power-law distribution instead. We conjecture a simple 
preferential attachment generative model that provides the basis for the observed power-law distribution. 
We then show how an existing statistical method for detecting cancer driver genes can be improved by 
incorporating the power-law distribution in the null model. 

Index Terms 

Copy Number Variation, Power-law Distribution, Generative Mechanism, Cancer Driver Genes Detection 

I. Introduction 

COMPREHENSIVE knowledge of the genomic aberrations that underlie cancer is of vital 
importance for diagnostics, prognostics, and the development of targeted therapies. Towards 
this goal, large databases of genomic cancer-patient data are being generated in recent years. One type 
of such data is Copy Number Variation (CNV) data. CNV is structural variation in which relatively 
large regions of the genome are either amplified or deleted, leading to gain- or loss-of-function of 
the genes contained in the affected regions. 
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CNV data consists of copy-number values of thousands of markers corresponding to different 
locations in the genome. To reduce the noise in this data, sets of neighboring markers are often 
combined resulting in contiguous segments of equal copy number, classified into normal, amplified, or 
deleted segments. Examples of such tools, usually called 'segmenters,' include GLAD 0, CBS ifTTTl . 
and a method developed by Mishra's group [0. The abnormal segments correspond to duplication 
or deletion events and are used as input data to identify regions containing genes that are relevant 
for the development of cancer, (e.g., methods described in [[9l [U). 

The underlying process generating these CNV segments is generally assumed to be memory-less, 
giving rise to an exponential distribution of segment lengths. In this paper, we provide evidence from 
cancer patient data, which suggests that this generative model is too simplistic, and that segment 
lengths follow a power-law distribution instead. We conjecture a simple preferential attachment 
generative model that provides the basis for the observed power-law distribution. 

From a thorough understanding of the statistical properties of genomic copy-number data in cancer, 
one expects to discover (either directly or indirectly) improved oncogenomics features, using statistical 
inference tools which build upon more accurate null-models (examples of these tools include 0 [TH 
El 13 [ID- In this paper, we provide one such improved estimator to an existing statistical method (due 
to Ionita et al. [9]) for detecting genetic regions relevant to cancer, which we achieve by incorporating 
the power-law distribution in the null. We analyze three TCGA CNV data sets and show that the 
improved model based on power-law distribution outperforms the simpler null model which only 
uses a non-informative prior. 

December 31, 2013 

II. Evidence and Fitting 

We analyzed three CNV data sets from The Cancer Genome Atlas (TCGA): Lung Squamous Cell 
Carcinoma (LUSC 201 patients), Glioblastoma (GBM 299 patients), and Ovarian Serous Cystadeno- 
carcinoma (OV 337 patients) [] The level 2 data was segmented using the segmentation algorithm 
of Daruwala et al. (2) and the empirical segment-length distributions of amplifications and deletions 
were fit to both power-law (cx~ a ) and exponential (ce~ Xx ) distributions. 



Figure 1 shows the segment length distribution and fitted functions for the deleted segments of 



J http://cancergenome.nih.gov/ The datasets used are: LUSC HMS_HG-CGH-415K_G4124A, GBM HMS_HG-CGH-244A, and OV 
HMS HG-CGH-415K G4124A. 
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the OV dataset, and Table I lists the numerical values of all fits, as well as their R 2 goodness of fit. 
Plots for the remaining data sets can be found in figure |2| of |Section~A 




Segment-length bin number 



Segment-length bin number 



Fig. 1: Segment length distribution and fitted functions of deleted segments from the OV dataset. 
The best power-law fit is shown on the left and the best exponential fit on the right. See Appendix 



AHFigure 2 for the images showing the fits for all other data sets. 





best exponential fit 


best power-law fit 




function 


R 2 


function 


R 2 


LUSC Amp 


e -U.Ul4x 


0.65 


x~ L27 


0.86 


LUSC Del 


e -0.008x 


0.45 


£-0.89 


0.79 


OV Amp 


e -0.014r 


0.67 


£-1-39 


0.91 


OV Del 


e -0.013x 


0.64 


£-1.30 


0.91 


GBM Amp 


e -0.015x 


0.39 


x~ im 


0.71 


GBM Del 


e -0.012x 


0.60 


£-1-20 


0.78 



TABLE I: Comparison of exponential and power-law fits for three TCGA data sets: LUSC, OV, and 
GBM. 



To determine threshold values for amplifications and deletions, we suitably modify the method 
described in [8], which implies that a segment is treated as an amplification (or resp. a deletion) if 
its value greater (or reps, smaller) than the mean plus (or reps, minus) twice the standard distribution 
(AVG ± 2STD). The fit was estimated by collecting all the segment-lengths of segments above the 
amplification threshold value or below the deletion threshold value and taking a histogram of the 
segment lengths. To make the fit particularly sensitive to the tail of the distribution, we chose to fit 
the log of the data against the log of the exponential and power-law distributions. 

As shown in |Table 1} in all three datasets, the power-law fits the segment-length distributions better 
than the exponential one. 

Several remarks about this result are due at this point. First, the remaining segments that are not 
considered amplifications or deletions (the 'Normals'), are not clearly power-law (nor exponentially) 
distributed (see Appendix |A| Table [IlT| for the actual fits, and |Figure 3| for an illustrative figure). The 
power-law distribution only appears to fit segments above (or below) a certain threshold. In Appendix 
|Aj we provide some analysis of the fits relative to a selected threshold. Second, taking the logarithm 



Downloaded from http://biorxiv.org/ on September 18, 2014 

4 

of the data is a way to magnify the difference between the power-law and exponential fit, which 
occurs mostly in the tail. It should be noted, however, that it does not affect the relative goodness of 



the exponential and power-law fit, as can be verified by the results listed in |Table V| in the Appendix 
A. Generative Model 

The observed power-law distributions for amplifications and deletions can be explained by a 
mechanism of preferential attachment. That is, once a region has large aberrations, it is more likely 
to acquire even more numerous large aberrations. One straightforward reason that could underlie this 
mechanism is that large amplifications or deletions lead to genomic instability and hence allow for 
subsequent large copy number aberrations. 

III. Improving tools through More Accurate Statistical Null-Models 

Most of the tools that are developed to analyze genomic data assume a non-informative exponential 
null-model for segment length distribution (e.g., segmenters J3 and tools for detecting cancer genes 
(9l). Knowledge of the fact that segment lengths are not exponentially distributed allows us to improve 
our null models and hence our tools. This resulting prior is especially important when there is not 
sufficiently enough data available to accurately predict null-models from the data. In the next section 
we show how an existing tool for detecting cancer genes can be improved. 

A. Statistical Method for Detecting Cancer Genes 

In this section we adopt a method described in ^ for finding cancer driver genes from copy number 
variation data by building upon the assumption that segment lengths are power-law distributed. 

Cancer genes are generally divided into two types: tumor suppressor genes (TSGs) and oncogenes 
(OGs). TSGs prevent tumor development by regulating cell growth. A loss or reduction in its function 
(for example by a deletion), can lead to uncontrolled cell division and allows the cancer to progress. 
Oncogenes, on the other hand, are genes whose function promote proliferation. Gain-of-function 
mutations (like amplifications), or overexpression, promote tumor progression. In the case of TSGs 
a deletion of a part of the gene will cause a loss-of function, while for OGs the whole gene needs 
to be amplified as a whole to cause a gain-of-function. 
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The algorithm for finding TSGs and OGs enumerates all possible intervals and assigns to them a 
score function that measures the likelihood of this being a driver gene. This score function can be 
described as follows: 

For any interval I the strength of the association between deletions in I or amplifications of / and 
the disease is quantified by analyzing the genomic data for many individuals with a specific type of 
cancer. For this purpose, a metric called Relative Risk (RR eV ent / ) assigns a numerical value to any 
event, a deletion or amplification of an interval, which thus compares the probability of the disease 
occurring with or without the event. Informally, RR ev ent / is the degree to which the occurrence of 
event / raises the probability of the disease incidence. Formally, 

tdtd / P{ disease | event / 

nit event 1 — Ln P( disease | NOT event / ) 



= In 
= In 



P( event I | disease ) P( NOT event I ) 

P( NOT event / | disease ) P( event / ) 

P( event / | disease ) 



— I— {-In \ ^( event 7) 1 1 
^\ ut [P( NOT event / )J J ' ^' 



P( NOT event I | disease ) 

where, in case of a deletion, "event J" denotes the event that at least part of / is deleted. We call 
this event '/ broken'. In case of an amplification "event /" denotes the event that there exists an 
amplified interval that fully includes I. We call this event 'I increased'. 

The first term in equation (1) can be computed from the available tumor samples: 

^ event I 



P( event / 


disease ) 


P( NOT event / 


disease ) 



n NOT event I 



where n eV ent / (or n NO t event / ) is the number of patients in whose tumor genomes the event / occurs 
(or does not occur). Note that becasue of the intrinsic differences between TSGs and OGs in case of 
deletions, the longer the segment the larger n event 1 — whereas in case of amplifications the situation 

' & & & n NOT event I ^ 

is reversed: longer segments have smaller n event 1 — . This imbalance is corrected for by the second 

00 U NOT event J J 

part of (1), 

P( event / ) 



—In 



_P{ NOT event / ) 

which incorporates prior information inherent in the statistical distribution of amplifications and 
deletions. 

To compute the prior score, we assume that, at any genomic location, a breakpoint (starting 
point) may occur as a Poisson process at a rate of \i > 0. We consider two different /x's: one for 
amplifications /xamp and the other for deletions /x DE l, but we drop the subscript when no confusion 
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arises. Segments are modeled as vectors. Starting at a breakpoint and moving left (or right) with 
probability |. The length t of each segment is distributed according to a power-law distribution: t~ a , 
with 1 < a < 2. Let e be the constant that represents the shortest length an interval could possibly 
have. 

Given these assumptions we can derive the prior probability that an interval / is amplified or 
deleted. 

Proposition III.l. Assuming that segment lengths are power-law distributed : 
1) The probability that an interval I = [a, b] is broken is as follows : 

P([a, b] broken) = 1 - e'^'^ x 

,a-l |- Q 2-c*_ e 2-a j 



e z l j x 



~^ 2 

6 



2) The probability that an interval I = [a, b] is increased is as follows: 



e 

P([a, 6] increased) = 1 — e _/i ^" 



6 2 - a -(6-a+e) 2 - a 
2-a 



X 



e a-l 

— ^ 2 

6 



where [0, G] represents the region of interest (e.g. a chromosome) and [a, 6] w an interval within this 
region. It is assumed that e <C G. 

□ 



The proof of this proposition can be found in Appendix |III.l 



The parameter a can be estimated from the data as described in section [II} The values of the 
/x DE l and /xamp parameters are the mean number of amplifications and deletions per unit length 
respectively and can be computed directly from the segmented data. 

The constant e can take any value. If we assume the value of e is 1 unit (corresponding to a single 
probe in microarray data or a single base in sequencing data) the probability that a segment is broken 
approaches: 



P([a, b] broken) = 1 - e -^- a ) x 



1 \a 2 ~ a 
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Similarly for amplifications: 



P([a, 6] increased) = 1 — e M 2 



I \ b 2 - a -(b-a) 



X 



1 (G-o)- 




e 



The score can be used to estimate the location of tumor suppressor genes and oncogenes. 
The simplest algorithm first computes the score for all intervals with value in a range determined 
by lower and upper bounds, and then picks the highest scoring interval on each chromosome. Many 
other algorithms can be imagined. For example, one can use two scoring functions to compute the 
left and right boundaries of the interval separately. The final step of the algorithm is significance 
testing of the obtained intervals. The methods as described in [|9l for tumor suppressor genes, and 
in m for oncogenes can be directly applied. Both methods assign a p-value for every putative TSG 
or oncogene using tools from scan statistics lfT2l . 

We have implemented the algorithm by computing the RR score for each interval while keeping 
track of the highest scoring interval. Because each interval needs to be visited only once the time 
complexity is linear in the number of intervals. 

Instead of finding only the interval with maximum score on each chromosome we can let the 
algorithm pick higher scoring intervals. One straightforward way is to pick the n non-overlapping 
significantly amplified/deleted intervals with the highest score, by keeping track of a list of results 
while going through the set of all intervals. This method has certain shortcomings as described in 
the discussion section. 

B. Performance Comparison 

To be able to test the influence of the improved null model, we have applied the afore-described 
algorithm with both the original exponential and the power-law null models to the three TCGA 
datasets: OV, LUSC and GMB. 

To compare the two models we asked which of the commonly amplified or deleted genes in the 
three cancer types were found by the respective algorithms. The results are summarized in table 
|II| Consistent with our expectation, the power-law based model performs (slightly) better than the 
exponential model. 

Note that despite the (slightly) better performance of the algorithm with the power-law null model 
over the exponential model, the difference between the two performances is comparable and both 
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Cancer 


Gene 


Power-law 


Exponential 


OV 


BRCA1 


no 


no 




BRCA2 


no 


no 




ERBB2 


no 


no 




K-ras 


yes 


yes 




AKT2 


no 


no 




PIK3CA 


no 


no 




c-MYC 


next 


no 




p53 


no 


no 


LUSC 


CDKN2A 


yes 


yes 




FGFR1 


yes 


no 




PDGFRA 


no 


no 




SOX2 


no 


no 




HWSCL1 


next 


no 


GBM 


EGFR 


next 


next 




MDM2 


no 


no 






no 


no 




CDK4 


no 


no 




Rb 


no 


no 




CDKN2A 


yes 


yes 



TABLE II: List of genes that are commonly altered in OV, LUSC and GBM cancer cells, and 
whether or not they were found by the power-law and exponential methods using the three highest 



scoring non-overlapping intervals. A more detailed version of this table can be found in |Table VIII 
in Appendix [Cj 



algorithms appear to miss many cancer genes. Both methods can be further improved by including 
additional information (e.g., gene-ontologies, gene-networks or pathways). In such a setting, as well 
as when regions for many more genes are checked, the contribution from more accurate null model 
is expected to be more pronounced. 

We offer several explanations for the missing genes. For example, the algorithm only picks out a 
few (in this case three) high scoring intervals per chromosome. Often, these intervals are in the same 
region close to a single gene, which causes other regions of interest to be overlooked. For example, 
in the OV dataset, all three deleted intervals that were found on chromosome 17 were close to (but 
not exactly overlapping with) BRCA1. It became therefore impossible to find P53, which also lies 
on chromosome 17, as well. This problem can be resolved by adopting more sophisticated statistical 
methods for selecting high- scoring intervals. 

In addition, regions either right next to actual genes or close to the centromere were often identified 
as likely cancer genes. We expect this type of error to disappear as methods for CNV data collection 
become more precise. In the next section, we briefly mention several other possible ways to improve 
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the method for finding driver genes. 

IV. Conclusions and Discussion 

In summary, we have provided evidence suggesting that the segment lengths of CNV amplifications 
and deletions in cancer cells follow a power-law distribution instead of the commonly assumed 
exponential distribution. This evidence suggests a generative mechanism of preferential attachment: 
many long amplifications and deletions lead to even more long amplifications and deletions. Even 
though our data analysis rules out exponentially distributed segment lengths, and the evidence for 
power-law distribution is compelling, other distributions (such as log-normal or stretched exponential, 
Table VI| in Appendix [A]) cannot be completely excluded on the basis of this evidence. 



see 



Especially in cases where only a small sample of data is available to estimate the prior distribution 
from the data, knowledge about the statistics of CNV data allows us to improve our analytic tools. 
As an example, we have demonstrated how the technique for finding cancer driver genes described in 
flU can be modified to incorporate the power-law distribution and, as our preliminary results indicate, 
how the power-law-based scan-statistics algorithm outperforms the exponential one. Once inferred, 
the set of cancer driver genes can be used as input to cancer progression extraction algorithms to 
derive progression models from static cancer patient data (see e.g., 0 SI [6l [5)), leading to improved 
diagnostics, prognostics, and targeted therapies. 

We note in conclusion that, despite its promise, these results represent an analysis that remains 
largely preliminary in nature. More recent single-cell single molecule genomic data have shed light 
on the significant heterogeneity and temporality that exist in cancer progression - namely, a tumor 
consists of a heterogeneous population of cell-types and the cells of different cell-types interact dy- 
namically going through rapidly-changing cell-states. Thus, more sophisticated oncogenomic analysis 
tools will need to generalize the mathematics described here much further, in which the null model 
must include a mixture of distributions, with the parameters of the distribution fluctuating as cancer 
progresses. Consequently, the tool to find cancer driver genes can be further improved in several 
ways. For example, we will need to incorporate a preferential attachment model to the segmenter 
that analyzes the genomic data from each cell-type; use more accurate priors of the distribution 
of breakpoints that are known to occur in different cell-types; apply more sophisticated statistical 
tools for picking high- scoring intervals by incorporating prior biological knowledge (carefully, so as 
to avoid Bayesian bias); and include such information (i.e., how pathways affect the cell-states) in 
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combination with precise correction for multiple hypothesis testing in order to make the final results 
more meaningful. But, to keep the focus on just the algorithmic/mathematical nature of this problem, 
the formulation developed here has been kept rudimentary; thus, a more practical description of a 
complete solution has remained outside the scope of this paper. 
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Appendix A 
Segment-length distribution 

Let AVGc and STD C (resp AVG N and STD N ) denote the average segment-length and the 
standard deviation of all segments derived from tumor (resp blood-derived normal) cells. 



LUSCamppI LUSC amp exp LUSCdelpI LUSCdelexp 




Segment-length bin number Segment-length bin number Segment-length bin number Segment-length bin number 

Fig. 2: Segment length distribution and fitted functions for all three datasets: LUSC, OV, GBM. The 
thresholds are AVG C ± 2STD C . 



OV normals pi OV normals exp 




1 10 100 i io loo 

Segment-length bin number Segment-length bin number 



Fig. 3: Segment length distribution and fitted functions for OV 'Normals'. That is, all segments with 
segment values in [AVG C - 2STD C , AVG C + 2STD C }. 





best exponential fit 


best power-law fit 




function 


R 2 


function 


R 2 


LUSC Nrm 


e -U.U2U 


0.70 


£-1.30 


0.57 


OV Nrm 


e -0.033 


0.89 


^-2.65 


0.92 


GBM Nrm 


e -0.016 


0.50 


x -1.09 


0.43 



TABLE III: Distribution fits of the 'Normals'. 
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Treshold 


ov 




AMP 


DEL 




th 


PL 


EXP 


th 


PL 


EXP 






a 




A 






a 




A 




±1.0 


1.00 


1.41 


0.90 


0.024 


0.64 


-1.00 


1.20 


0.85 


0.015 


0.52 


AVG C ± 2STD C 


0.76 


1.39 


0.91 


0.014 


0.67 


-0.87 


1.30 


0.91 


0.013 


0.64 


AVG C ± 1.5STD C 


0.59 


1.79 


0.93 


0.031 


0.81 


-0.66 


1.82 


0.93 


0.028 


0.75 


AVG C ± 1STD C 


0.36 


1.94 


0.93 


0.030 


0.84 


-0.46 


1.99 


0.90 


0.033 


0.85 


A MT 1 4- K QT n 
AV Lrjv ± Ool L>N 


0.21 


2.11 


0.88 


0.0251 


0.85 


-0.27 


2.21 


0.85 


0.0343 


0.87 


AVG N ± 3STD N 


0.11 


1.85 


0.85 


0.0255 


0.87 


-0.18 


1.99 


0.86 


0.0343 


0.89 


AVG N ± 2STD N 


0.06 


1.79 


0.83 


0.025 


0.89 


-0.13 


1.96 


0.86 


0.034 


0.90 


0.0 


0.00 


1.79 


0.83 


0.025 


0.89 


0.00 


1.96 


0.86 


0.034 


0.90 



TABLE IV: Using the OV dataset, this table shows how different tresholds influence the power-law 
and exponential fits. 




Fig. 4: OV deletions segment length distributions for different thresholds: 0, AVGc±lSD c = —0.46 
and —1. 




Fig. 5: Distribution of segment values of all segments (left), all positive segment values (middle) and 
negative segment values (right), on a log-log scale 

Appendix B 
Proof of proposition HILJJ 

The Model: We assume that, at any genomic location, a breakpoint (starting point) may occur as 
a Poisson process at a rate of fi > 0. We consider two different /x's: one for amplifications /xamp and 
one for deletions /x DE l, but we drop the subscript when no confusion arises. Segments are modeled 
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best exponential fit 


best power-law fit 




function 


R 2 


function 


R 2 


LUSC Amp 


g — (J.27X 


0.96 


r£— 1.2b 


0.97 


LUSC Del 


— 0 48a; 

e 


A A/I 

0.94 


^-1.58 


A AA 

0.99 


uv Amp 


e -0.26x 


0.96 


x -1.50 


0.99 


OV Del 


e -0.30x 


0.91 




0.99 


GBM Amp 


g— 1.51a? 


1.00 




1.00 


GBM Del 


e -0Alx 


0.91 


x -l-39 


0.97 



TABLE V: Exponential and power-law fits for non-log data. 





best exponential fit 


best power-law fit 


best log-normal fit 




R 2 


R 2 


R 2 


LUSC Amp 


0.65 


0.86 


0.82 


LUSC Del 


0.45 


0.79 


0.77 


OV Amp 


0.67 


0.91 


0.77 


OV Del 


0.64 


0.91 


0.86 


GBM Amp 


0.39 


0.71 


0.71 


GBM Del 


0.60 


0.78 


0.76 



TABLE VI: Exponential (ce ), power-law (cx a ) and log-normal (- 



(ln(x)-a) z 



) fits. 



as vectors. Starting at a breakpoint x and moving left (or right) with probability |. The length t of 
each segment is distributed according to a power-law distribution: t~ a 9 with 1 < a < 2. Let e be the 
constant that represents the shortest length an interval could possibly have. 

Proposition B.l. Assuming that segment lengths are power-law distributed : 
1) The probability that an interval I = [a, b] is broken is as follows : 



P([a, b] broken) = 1 - e'^'^ x 



£ a-l 



(G-b) 



2 -a | x 

2-a 2-a 



2-a, 



2) The probability that an interval I = [a, b] is increased is as follows: 

P ([a ^b] increased) = 1 — e _/i 



2 



e a-l 



6 2 -^-(6-a+e) 2 - a 
2-a 



1) We wish to estimate the probability that the interval [a, b] is 4 broken'. This is the probability 
that there exists a deleted interval / that intersects with [a, b]: 
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P([a, b] broken) = P(3I : / n [a, b] ^ 0 and / is deleted). 

Instead, we compute P([a, 6] is NOT broken) by computing: 

(Pi) The probability that no deletion occurs starting in the interval [a, &], 

(P 2 ) The probability that each deletion starting in [0, a] does not overlap [a, 6], and 

(P 3 ) The probability that each deletion starting in [&, G] does not overlap [a, &]. 

It follows that P([a, 6] is NOT broken) = P 1 x P 2 x P 3 . Thus, 

P([a, 6] broken) = 1 - P([a, 6] is NOT broken). 

(Pi) P( no deletion starts in [a, 6]) = e~^ b ~ a \ This equation follows immediately from the 
assumption that breakpoints are generated by a Poisson process. Note that we drop the 
subscript DEL in /x DE l- 

0^2 ) P{ eac h interval starting in [0, a] does not overlap with [a, 6]) can be broken down as the 
following infinite sum: 

P( each interval starting in [0, a] does not overlap with [a, b]) = 

P( no deletions start in [0, a]) 
+ P(l deletion starts in [0, a]) x P(the deleted interval n [a, b] = 0) 
+ P(2 deletions start in [0, a]) x P(both deleted intervals n [a, b] = 0) 
+ ... 

By the assumption that breakpoints are generated as a Poisson process, 

the probability P(n deletions start in [0, a]) = (fia) n ^^- for each n. The probability P(l deleted interval D 

[a, b] = 0) can be computed as follows. From our model it follows that P( deleted interval D 

[a, b] = 0 | 1 deletion starts in [0, a]) is the probability that each deletion starting at x in the 

interval [0, a] does not reach all the way to a: 

P( deleted interval n [a, b] = 0 | 1 deletion starts in [0, a]) 

= i+i^irir^dtdx+e), 

where: 
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- The constant c e ^ depends on the length e and G and is computed below. 

- The |'s are to take into account the possibility that the deletion moves left instead of 
right. 

- The last term +e takes into account the possibility that the starting point of the deleted 
interval is in [a — e, a]. 

The preceding equation can be simplified as follows: 

P{ deleted interval n [a, 6] = 0 | 1 deletion starts in [0, a]) 



11 

2 a 



a—e pa—x 
0 Je 



2 ~ 2a 



J7 * c £&t a dtdx + e 







^ C e ,G 




' ("-("-*)) 


1+ 


27i 


K l-a 


j 


K (2-o) 






( c € , G 


"a 2 -«-e 2 -« 


1+ 


2a 






2-a 


1+ 




( °e,G 




- a 2- a _ e 2- a 


2a 


{(1-a) 




2-a 



~\ dx + e^j 



2-a 



A-a 



(a — e) 



- e 
+ e 



There are a few points to make regarding this derivation: 
- We ignore the integration constants, as they cancel each other out. 



- Since a > 1, and c^q, a > 0, the term 



We thus need to show that 



Ce,G 

2a(l-a) 



is negative. 



2-a 



— e 1 a (a — e) is always negative to obtain a 



positive probability. This follows from the mean- value theorem. Namely, for any function 
/ that is concave and increasing the following holds: 

f(x)-f(x-S)<Sf(x-S) 



the function f{x) 



r 2-a . 



2-a 



is concave and increasing with f{x) = x 1 a . If we let x = a 



and 5 = a — e then x — S = e and we have 



,2-a 



2-a 2-a 



< (a-e)e L ~ a : 



from which it follows that q2 — e 1 a (a — e) is negative. 

The normalizing constant c^q can be computed as follows. It has to be such that 

rG 



J c e , G t- a dt=l. 
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It follows that 



Ce,G = (£t-"dt)-l 



I— a 1—a 



Since a > 1 and G»e this approaches 



l-a 



a-1 



Using c e ^G = jh=!z, we can simplify P{ deleted interval D[a, b] = 0 | 1 deletion starts in [0, a]) 
as follows: 



= 1 I 1 f a-l 1 

2 ^ 2a I e 1 -^ (l-a) 



2 ' 2a 



2 ^ 2a 1 fc 



2-a 

2-a_^2-<* 



+ e 



2-a 

^2, — ex. ^2 — ex 

2^ 



-e 1 - a (a-e) +e) 
+ (a - e) + e 



= 1 



2a 



2-a 



Since deletions are assumed to be independent events that can overlap it follows that 
P(n deleted intervals n [a, 6] = 0 | n deletions starts in [0, a]) = P( deleted interval D 
[a, b] = 0 | 1 deletion starts in [0, a]) n 
Hence, we get the following series: 



-jia 



- jJLCL 



P 2 = e -/» + M 1 — (1 -w) + M 2 ^p(l - wf + 



with w 



2a 

It follows that 



2-a 



P, = e 



-[la- 



la 2-a 



which can be simplified to 



£ a-l [ a 2-^_ e 2-a 



(P 3 ) P( each interval starting in [b,G] does not overlap with [a, 6]) is computed in the same 
way as P 2 > but now starting at x G [ft, G] and moving left. In this case 
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P( deleted interval n [a, b] = 0 | 1 deletion starts in [6, G]) 



i 

2 



+ 



1 1 

2 G-6 



1 - 



(G-b) 2 ~ a -e 2 - 



2(G-b) 



2-a 



and we obtain 



Ps = e 



2 



zi ( G ~ b ) : 




It follows that 




x 



~L L 2 



e^- 1 (G-b) : 




e 



2) In an analogous fashion, we calculate the probability that the interval [a, b] is' increased'. This 
is the probability that there exists a deleted interval / that includes [a, 6]: 



We compute P([a, 6] is NOT increased) by computing: 

(Pi) The probability that each interval starting in [0, a] does not include [a, &], and 

(P 2 ) The probability that interval starting in [6, G] does not include [a, 6]. 

The computation of Pi (and P 2 ) is exactly like that of deletions, except for the fact that we can 

to integrate over all intervals reaching up to b (down to a). In the case of Pi, we solve 

P([a, 6] C amplified interval | 1 amplification starts in [0, a]) 



P([a, b] increased) = P(3I : [a, b] C J and / is amplified). 




= 1 



6<»-i r fr 2 - a -(fr-a+e) 
2a 2-a 



and in the case of P 2 



P([a, 6] C amplified interval | 1 amplification starts in [6, G]) 
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_ \b 2 - a -(b-a+e) 2 - a 

2 I 2 ~ a 

(G_ a )2-^_ (b _ a+e) 2-^ ' 
2-a 



Appendix C 
Detecting driver genes 



Cancer 


amplifications 


deletions 


normals 


OV (337) 


13416 


10237 


82633 


LUSC (201) 


3637 


1832 


46215 


GBM (299) 


2131 


3959 


41458 



TABLE VII: Number of deleted and amplified segment for three TCGA data sets using a threshold 

of AVGn ± 2STD C . 



We obtain: 

P([a, b] increased) = 1 — e _/i 

e a-l 
~V J 2 

6 



Canopr 


Gene 


Function 

X Ullvllvll 


T vocation 


Powpr-1 nw 

x vy VV vl J-CIVV 


Fxnonpntial 

XyA L/V/llvll Lltll 


ov 


BRCA1 


TSG 


17* (41196312 41277500, 

x / . i r x i ,y V7 ^y i z- . . r x^ / / i 


no 


no 




BRCA2 


TSG 


13* (32889617 32973809) 

x . \.j.^vyvyWJX / . . ^' z^ y 1 ^uuy , 


no 


no 




FRBB2 

I <1VU x-l Z- 


OG 


17- (37844393 37884915, 

i / . \— ' / Oil y .j . . .j 1 uu i / i j 


no 

l ivy 


no 

1 1 vy 




IC-ttw 


OG 


12- (25358180 25403854, 


ves (25289555 25421243) 


ves (25177510 26726740) 

V \y\5 v^Z-^y l / / ^y l vy . . Z- Vy / Z- Vy 1 \\J j 




AKT2 


OG 


19* (40736224 40791302) 

1 y • y r vy / ^y \j ^ ^ r • • r vy I y v *y \j ^ i 


no 


no 




PIK3CA 


OG 


3* (1 78866311 178952500, 

«_/ • \ x / vy vy vy vy ^y x i • • i / ij^x^y vy vy / 


no 


no 




c-MYC 


OG 


8* (128748315 128753680, 

vy • \ x — vy / ru «_/ x ^y • • x — vy i ^y ^y vy vy vy / 


next (128797789 128989029) 

llvAl, \ X ZW I y I 1 KJ y • • X ^VJ y \J y \J y 1 


no 




p53 


TSG 


17- (7571720 7590868, 

X / . y I 1 X / Z- Vy . . /.JVVyVJVJVJ/ 


no 


no 


LUSC 


CDKN2A 


TSG 


9* (21967751 21994490, 

V. 1 ^lyu / / «_y X • • Z- 1 v v l l yf\J J 


ves (18947155 28723296) 


ves (21983401 21993651) 

V lzlyUJ~Ul..zlyy JUJ 1 , 




FGFR1 


OG 


8* (38268656 38326352) 


ves (38303346 38369274) 


no 




PDGFR 


OG 


4* (55095264 55164412) 

r • \ «~y vy ^ ^y vy r • • ^y ^y x vy i r x — / 


no 


no 




SOX2 


OG 


3* (34650005 34652461) 

«j • ^ v r vy ^y vyvyvy^y . • ^y r vy ~j z- r vy i j 


no 


no 




WHSC1L1 

T Y X XVJ X J / X 


OG 


8* (38132560 38239790) 

Vy • . ^/U 1 ^/i^^' vyvy . •^'Uii^/ y / y \J 1 


next (38303346 38369274) 

iivA v \ ^y vy vy ^y ^y r vy ..^j u^j vy .y ^z / r / 


no 


GBM 


EGFR 


ONCG 


1' (55086725 55275031) 

/. y ~J ~J vy vj vy / Z- ^y . . ^y ^y Z- / ^y Vy ^y 1 J 


next (55049021 55065490) 

llV^yVL JU i y V7Z 1 . . J JV7U J i y V7 ^ 


next (5499841 1 55043660) 

llV^yVL ^JT/y U i 1 1 . . ^7 ^y V/ i J vy VJ Vy y 




MDM2 


ONCG 

KJ L \ V_^ VJ 


12- (69201971 69239320) 

1 — . ^v^y ZV7 1 y 1 1 . . V7 y Z ^y y ^y Z- Vy y 


no 

l ivy 


no 

1 1 vy 




PDGFR 


ONCG 


5- (149493402 149535422) 

•J . \ X ~ y\ y^Ji\J£*» • X ~ y* *J *J~£*£* 1 


no 


no 




CDK4 


ONCG 

vjx > v^vj 


12* (58141510 58146230) 

1 _ . 1 .J O 1 " 1 ^y 1 vy . . ^y VJ 1 i vJZ ^y V7 , 


no 

1 1 VJ 


no 

1 1 V7 




Rb 


TSG 


13: (48877883..49056026) 


no 


no 




p53 


TSG 


17: (7571720..7590868) 


no 


no 




PTEN 


TSG 


10: (89623 195..89728532) 


no 


no 




CDKN2A 


TSG 


9: (2 1967751.. 2 1994490) 


yes (21973069..21983401) 


yes (21973069..21983401) 



TABLE VIII: List of genes with their locations that are commonly altered in OV, LUSC and GBM cancer cells, and whether or not they were 
found by the power-law and exponential methods using the three highest scoring non-overlapping intervals. The OV and GBM genes were taken 
from the Kegg database ( http://www.genome.jp/dbget-bin/www_bget?ds:H00027 and http://www.genome.jp/dbget-bin/www_bget?ds:H00042); 
the LUSC genes, for which no Kegg entry exists, are commonly amplified/deleted LUSC driver genes from ifTCfl (mentioned on page 519). A 
gene is considered 'found' if the selected interval intersects with the region containing the gene. In this table 'next' indicates within lOOkbp 



from a border of the gene interval. The parameters /i, a, and A were estimated from the data as in [[9] and |Table I] with the exception of a of 
LUSC del which was set to 1, as the computation of RR score assumes a > 1. Segments shorter than 10 4 base pairs (corresponding to the 
distance between two probes) and longer than 10 7 base pairs were excluded. 



